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Abstract 

The problem of autonomous navigation is one of the ba¬ 
sic problems for robotics. Although, in general, it may 
be challenging when an autonomous vehicle is placed 
into partially observable domain. In this paper we con¬ 
sider simplistic environment model and introduce a nav¬ 
igation algorithm based on Learning Classifier System. 


Introduction 

The problem of navigation in partially observable domain 
may be computationally challenging since state space of an 
environment in general grows exponentially with the size of 
the environment. Current studies suggest mostly usage of 
Partially Observable Markov Decision Process, for example, 
(Koenig a nd Simm on s 1998|), (|S immons and Koenig 1995] ), 
( Cassandra, Kaelbling, and Littman 1994] ) however POMDP 
usually implies co mputational challenges ( [Papadimitriou 


andTsitsiklis 1987| ) which make direct application quite dif¬ 


ficult. To avoid this problems number of technique is used 
such as division of d omain state space ([Dean et al. 1993| ), 
hierarchical POMDP ( |Foka and Trahanias 2002| ) etc! 

Alt ernative approaches also take plac e, including fuzzy 
logic (| Pratihar , Deb, and Ghosh 1999\ , ‘bug’ algorithms 
( jZohaib et al. 2013 ). It has been recently shown that reactive 
navigation models can be s uccessfully trained f or the prob¬ 
lem of obstacle avoidance (|Ram et al. 1994| ), ( [Whitbroo^ 
lAickelin, and Garibaldi 2008[ ). In this paper we will continue 
genetic approach to train models for navigation in partially 
observable domains. 

In contrast to mentioned works to avoid unnecessary 
complications we consider simplistic model of domain. An 
autonomous robot is placed into cellular two-dimensional 
static environment with fixed width and height (W and H), 
each cell of which is either occupied or free: 




WxH 


where U — predefined set of possible domains; —1 corre¬ 
sponds to the free state of a cell, 1 — occupied. 

The robot is allowed to occupy exactly one free cell 
(which determines its position), has one of four directions 
and can execute 3 commands: go forward (according to its 
direction) and change its direction by turning left or right, 
which forms set of possible actions A. Robot’s goal is to 


find sequence of commands to reach predefined cell with 
coordinates ^y^). 

The robot has only one sensor, vision that reveals state of 
cells in fixed radius circle around its position, and also adds 
unknown cell state (which is encoded as 0): 

V = {-1,0,1}'^^-^ 


where V — set of possible visual observations. 

If vision function is v'^ where r is vision radius then ob¬ 
servation from position (xo,yo) is: 

v''(xo,yo) = v €V; 


U(x,y), if (xo - xf + (yo - < r; 

\0, otherwise 


where u e U corresponds to the current domain. 

To avoid unnecessary quality loss robot receives accumu¬ 
lated map of the domain where all observations are imposed, 
which can also be expressed by an element of V. 

Set of robot’s observations O consists of current vision V, 
position N^, direction D = {{x^y) \ x^y e { — 1,0,1}, \x\-\- 
\y\ = 1} and goal N^: 

0 = VxN^xDxN^ ( 1 ) 


We define robot’s navigation policy tt as a function ip, 
some state space and initial state sq e S^: 

'0(cj^) = (s^+\a) (2) 

where t is number of current step, — current observa¬ 
tion, 5^, G — policy states before and after making 
decision, a ^ A — the result action. 

By executing policy tt on each step robot forms its path 
p in the domain u E U. Quality of a policy in particular 
domain u is measured by the following cost function: 

Cu{p) = ^ ( 3 ) 

\Pu\ 

where p* is the shortest path from starting point to the goal 
in domain u. 

It can be shown that Pareto optimal policy tt* either 
doesn’t exist or the optimal policy reaches cost limit: 

Vtt : G f/ ; C(K*) < C{pl)] ^ 

VueU: C{pl) = 1 


































where p'!^ — path produced by policy tt. 

But since the last case where Pareto optimal policy ex¬ 
ists is not a general case, following POMDP approach, we 
introduce probability space ft = (f/, 2^,P), where P{u) 
corresponds to the prior probability of robot being ^aced in 
the domain u e U, and corresponding cost functio^ 

CW = E.}| , 4 ) 

Note that here expectation operator should be taken over 
the support of P{u) since it is not required from policies 
to find a path in a domains outside support of P{u) and 
becomes undefined. 

Now we can define the navigation problem considered in 
this paper: for given ft find policy (or its approximation) tt* 
that minimizes cost function < 0 : 

TT* = argmin E„|^ (5) 

TT \Pu\ 

Note that prior probability P{u) implicitly defines struc¬ 
ture of space of possible domains or weights. 


T,g={c,a>) w^(5)c(w)I[a' = a A c{uj) > 0] 

TTo (cc) = argmax —^ , -— -^— 

ae^ Eg=(c,a') w{9)I[a' = a A c(w) > 0] 

(7) 


where: 


m = 


if p; 

otherwise 


and w{g) > 0 — weight of gene g, which will be used later 
for reinforcement learning. We will refer to irf or tt^ as to 
fusion rules. 

Conditions are formed from a predefined finite set of pred¬ 
icates B. Each predicate from B like conditions is a function 
O [—1,1]. Let’s order predicates of B into vector of ba¬ 
sic predicates b and if size of 5 is n and a G then the 
convolution operator is defined as: 


{a C) h){uj) = 


aibiiid) 




Fill 


( 8 ) 


where \\a\\i — h norm. 

Set B defines genetic model Q : 


Policy model 

The policy definition 0 is a very general one since it con¬ 
tains a great class of algorithms. However, optimization in 
such algorithmic space can be problematic. Likely, it is pos¬ 
sible to narrow this space to mappings from ‘extended’ ob¬ 
servations to actions, i.e. to stateless algorithms. 

Lirstly, only algorithms able to find a path in all possible 
domains are worth considerations. Since number of possible 
observations, robot’s positions and directions are finite, only 
finite number of policy states is used. An algorithm is ether 
effectively stateless (can be represented through mapping) 
or is Pareto dominated by some mapping: 

1p{L0\s) = {'lp{u\s{u*)),s) 

where s{uj) — one of possible for uo states, that mini¬ 
mizes cost, s — the only state of the modified policy. 

Note that definition of uj as accumulated vision ob¬ 
servation is essential for this considerations. Lor con¬ 
venience we will define full observation as tuple 

{x^ ,yy ,(P ,y^) and will denote it simply as obser¬ 

vation in text below. Correspondingly we extends obser¬ 
vation set O. _ 

Lollowing the approach ( [Whitbrook, Aickelin, and] 
Garibaldi 2008| ) we define navigation policy as set of genes 
G, where each gene from G is a condition-action pair (c, a), 
where c : O i-> [—1,1] and a e A. Conditions c can be 
viewed as predicates of fuzzy logic with operators described 
below, where —1 corresponds to ‘false’, 1 to ‘true’ and 0 to 
maximal uncertainty. 

The final action can be form in different ways, we con¬ 
sider only two of them: 

= a-rgmax w{g)c{Lo) (6) 

g={c,a)eG 

^Like cost functions in POMDP a number of other functions 
can be used instead of 0 as long as for each domain u they 
monotonously increase by p^. 


Q = {(q^ 0b,a) \ a e a e A} 


Hence within genetic model each gene can be encoded by 
real vector and action. 

The most basic set of predicates Bq consists of elemen¬ 
tary vision predicates, position predicates, direction predi¬ 
cates and goal predicates. If cc = {v,p, d, g) then these pred¬ 
icates can be expressed as: 


Px-X 


bU^) = 
blG) = 


w ’ 
Py-y. 
H ’ 


b’^,{uj) = d ■ d'\ 


where d-d' — scalar product. Goal predicates and b^ are 
similar to position predicates b^,by- 

It easy to see, that for each observation uo model Qo con¬ 
tains a condition c^, that distinguishes cc: 


UJ = argmaxc(co’') 
uj'eo 


which means, that, using fusion rule 7rf, any policy can be 
represented with some G C Go. However the same is not 
always possible for tt^ 

The main informal assumption for this model is that in 
spite of the fact that in general 2^^^ x (W x B[)‘^ x 4 genes 
is required to express some policy, the efficient ones contains 
large clusters of rules, that can be efficiently expressed by 
one gene, reducing size of G to computationally acceptable 
number, and also that policies build with fusion rules ^2 
contains efficient ones. 

Unlikely, experiment shows that the assumption may not 
be true for model Go — none of the obtained policies formed 

^However, it is quite easy to build a model that makes every 
policy expressible in this model with fusion rule tt^. 













by Qo were able to find solution in at least one domain. The 
problem lies in initial guess for search — it’s very unlikely 
to randomly generate acceptable policy to start search with, 
and unable to evaluate policy search algorithms degenerates 
into random search. 

To solve this problem we extend model Qq with heuristics, 
which makes efficient algorithms be encoded by smaller sets 
of genes, at least for some domain probability spaces ft. 

The model can be extended by custom predicates O 
[—1,1], for example, euclidean distance from current posi¬ 
tion p to the goal g\ 


¥{uj) = 1-2 


lb-g||2 


(9) 


where || ■ ||2 — euclidean norm; 

It is also noticeable that ‘predictive’ predicates improve 
quality of the result policies, for example, ‘will forward 
command increase real distance to the goal’ with possible 
values 1 — it is strictly guarantied, 0 — unknown and — 1 
— the opposite of 1. 

But the major changes in structure of optimal policies are 
introduced by complete navigation heuristics, obtained from 
other policies, which might not be optimal, but able to find 
a path to goal point in every domain. If a complete stateless 
policy is denoted as ttq : O it produces |^| predicates 
for each a e A: 


b^{(jj) = 2I[7ro(cc) = a] — 1 (10) 

In the experiment for such heuristic we chose ‘naive’ 
navigation algorithm, which follows the shortest path from 
current point to the goal considering unknown points as 
free. At one hand, it is computationally cheap and requires 
0{H X W) operations or fewer depending on domain struc¬ 
ture and implementation of the algorithm. At the other hand, 
it is good approximation for optimal policies for domain 
probability spaces with wide support and high entropy (i.e. 
under wide assumptions). 

Since now model contains complete policy that can be ex¬ 
pressed with only 3 genes (that distinguishes 3 possible out¬ 
comes of heuristic), other rules are ether useless, since they 
cover special cases of observation which are already covered 
by these 3 basic genes, or play role of exceptions from this 
complete policy which makes more policies evaluable which 
allows to perform search without degenerating into random 
guessing. 


Policy learning 

Since the model naturally suits into terms of genetic algo¬ 
rithms and due to high dimensionality, we use simplification 
of Learning Classifier System ([F armer, Packard, and Perel- 
son 1986|) similar to (fWhitbrook, Aickelin, and Garibaldi 

2008 ir 

The standard for real vectors genetic procedures such as 
generation, mutation and crossover are used. 

Selection is performed by changing weight w{g) of gene 
g during learning phase. Genes with negative or lo\\[^eight 
are removed from set G. Each performed action is evaluated 
and reward i? G M spreads as increase in weight for all genes 
responsible for performed action. 


For the fusion rule tt^ only the gene g responsible for the 
current action receives penalty or reward: 

w\g) = w*~'^{g) + R 

For the other fusion rule reward is shared by all genes, that 
‘voted’ for performed action. If gene g = (c, a), observation 
on step t was cc, and performed action was a' then for each 
gene from G\ 

R 

w\g) = w*~^{g) + ——c{uj)l[a' = a A c{w) > 0.] 

Zj[a } 

where: 

Z{a') = c{uj)I[a'= a A c{w) > 0] 

geG,g={c,a) 


Evaluation of each gene performance is built on decrease 
of potential between states before and after performing an 
action: R = ^AG where coefficient (3 regulates speed of 
learning. 

Depending on the way potential is defined, there are two 
types of learning: supervised, when potential directly cor¬ 
responds to the true cost function for current domain and so 
uses unknown for the robot information about the domain, or 
unsupervised, when some heuristics are applied to approx¬ 
imated cost function. For supervised learning if state (posi¬ 
tion and direction) before performing an action is s and the 
one after — s' then: 


p{s) - p{s') 
p{so) 


( 11 ) 


where p{s) — length of the shortest path from state s to the 
goal point, sq — initial state. 

For unsupervised learning expression for the potential is 
similar, however now distance p should be estimated from 
current observation uj rather than from known map of the 
domain: 


AG = 


Puj{s) - Puj{s') 


( 12 ) 


Puj ('^o) 

Note that ‘naive’ policy follows decrease of the potential, 
that approximates distance p by considering unknown cells 
free and hence the more cells are known, the more accurate 
the estimation becomes. But direct usage of this approxima¬ 
tion makes ‘naive’ policy the optimal one. Instead, to make 
estimation more accurate, we suggest usage of observation 
from ‘future’ to form current estimation by holding gene 
evaluation for some number of steps M. 


_ Pcc;t + M (g) p^t + M (g ) 

p^t+M{so) 


(13) 


It’s reasonable to start with low M in order to achieve 
greater learning speed, gradually increasing M as policy be¬ 
comes stable. 

However, even for big enough M the procedure does not 
guarantee the same result as supervised learning, however it 
is still able to correct policy in some situations, for example, 
penalize for entering dead-end. 


^For example, used in the experiment strategy virtually divides 
set G into active zone and ‘incubation’ zone with restricted size N 
of the first one. After each step top N genes achieved predefined 
minimal life time are selected by weight. 


















Experiment 

For the experiment 20 office-like domains with size about 
100 by 100 cells were selected. These domains form sup¬ 
port of probability space ft with equal probability of each 
domain. Starting and goal points are placed on the oppo¬ 
site sides: = 0, = H — 1. Vision radius was set to 

r = 5. ‘Naive’ policy tto alone shows C(7ro) ~ 1.8 and de¬ 
pending on particular domain u: C{p^^) G [1.7, 2.0]. Since 
fusion rule tt^ simultaneously changes weights of group of 
genes, it was selected for the experiment. Experiment model 
extends Qo by ‘naive’ policy, dead-end detection, direct dis¬ 
tance to the goal point, detection of obstacles on right, left 
sides and ahead of the robot within vision radius. 

The robot placed into each of 20 domains one by one in 
random order. The same process have been repeated until 
convergence is achieved. Iteration of all 20 domains is de¬ 
noted as generation. 

Three different potentials were compared: supervised, un¬ 
supervised (denoted as retrospective) with increasing step 
by 2 each generation starting with M = 0 and potential 
of ‘naive’ policy denoted as classical (unsupervised with 
M = 0). 

As the result of the experiment policies with cost C{7r) ^ 
1.4,1.5 were obtained. As it can be seen from the figure]^ 
in the experiment unsupervised learning has lower conver¬ 
gence speed, but the same cost limit, in contrast to ‘naive’ 
policy potential, which converges around cost of ‘naive’ pol¬ 
icy as predicted. 

Future work 

As the experiment shows obtained policies widely use 
heuristic predicates. Possibly, forming proper predicate dic¬ 
tionary, it is achievable to fully negotiate model Qo, or at 
least, introduce two types of genes: standard and formed ex¬ 
clusively from heuristic predicates, with majority of the last 
type. Such dictionary allows to considerably reduce number 
of optimization parameters, which may speed up optimiza¬ 
tion or even allow usage of classical method of discrete op¬ 
timization. 
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(a) First generation. (b) 10-th generation. (c) The shortest path. 


Figure 1: Demonstration of convergences. Support of the domain probability space consists only from one domain. Triangles 
represent positions and directions of the robot on its path. 






Figure 2: An example of domain from training set in the experiment. Path of ‘naive’ policy is shown. 








Steps to optimal ratio 


Genetic algorithms by feedback 
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Figure 3: Vertical axis corresponds to the cost function, horizontal axis — generation number. The training set consists of only 
one domain shown on figure]^ 
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Genetic algorithms by feedback (mean, std) 
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Figure 4; Experiment results. Bars represent standard deviation within 20 selected domains. 
































